As the workflow for ATAC-seq QC was established in step 2, I performed this workflow on the entire data set. However, performing pseudo-bulking and normalization and subsequent visualization with the data from all cells is very computationally intensive. Therefore, pseudo-bulking and seurat provided normalization was performed separately from the visualization here.
These are the packages required for the code below.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(binom)
library(broom)
## Warning: Paket 'broom' wurde unter R Version 4.4.3 erstellt
library(Seurat) #for rna-seq data, atac-seq data
## Lade nötiges Paket: SeuratObject
## Warning: Paket 'SeuratObject' wurde unter R Version 4.4.3 erstellt
## Lade nötiges Paket: sp
##
## Attache Paket: 'SeuratObject'
##
## Die folgenden Objekte sind maskiert von 'package:base':
##
## intersect, t
library(Signac) #for atac seq data
First I loaded these data sets into R to work with them. The data sets were extracted by Lauren Saunders based from Lin et al. 2023.
readRDS("Data/Lin_2023_zfish_snATAC-seq/Cell-type_Peak_Matrix.rds") -> zfish_snATAC_seq_pk_mtrx #count matrix
read_csv("Data/Lin_2023_zfish_snATAC-seq/atac_all.metaData.csv")-> zfish_mta_data # meta data
## New names:
## Rows: 50637 Columns: 24
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (7): ...1, Sample, stage, Clusters, celltype, predictedCell, predictedG... dbl
## (17): TSSEnrichment, ReadsInTSS, ReadsInPromoter, ReadsInBlacklist, Prom...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
Then, I inspected the data sets with class() and
dim(). The matrix zfish_snATAC_seq_pk_mtrx
contains the peaks as rownames and columns as cell names and counts as
entries. As this is a sparse matrix, the 0s are written as dots. In
addition, the data frame zfish_mta_data contains the meta
data.
unique(zfish_mta_data$Clusters) #how many clusters in total
## [1] "C3" "C1" "C24" "C5" "C4" "C2" "C18" "C7" "C8" "C16" "C25" "C14"
## [13] "C21" "C19" "C12" "C17" "C15" "C10" "C9" "C11" "C23" "C22" "C13"
class(zfish_snATAC_seq_pk_mtrx)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
class(zfish_mta_data)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
dim(zfish_snATAC_seq_pk_mtrx)
## [1] 370058 50637
zfish_snATAC_seq_pk_mtrx[1:10, 1:2]
## 10 x 2 sparse Matrix of class "dgCMatrix"
## 24hpf_1#24hpf_1_BC00224_N06 24hpf_1#24hpf_1_BC00014_N03
## chr1:5232-5732 . .
## chr1:5787-6287 . .
## chr1:10088-10588 . .
## chr1:10991-11491 . .
## chr1:11895-12395 . 2
## chr1:12474-12974 1 .
## chr1:14016-14516 . .
## chr1:14704-15204 1 .
## chr1:16672-17172 3 .
## chr1:18404-18904 3 .
zfish_mta_data
For further analysis in this ATAC-Seq-QC to see whether this data set can be used to create a model which predicts chromatin accessibility based on genomic sequence alone, I will create a seurat object with the code below.
zfish_atac=CreateSeuratObject(counts = zfish_snATAC_seq_pk_mtrx, assay = "atac", meta.data = zfish_mta_data) #combines matrix and meta data into a seurat object
## Warning: Setting row names on a tibble is deprecated.
Here is an overview of the seurat object, obtained with the
glimpse function.
glimpse(zfish_atac)
## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
## ..@ assays :List of 1
## .. ..$ atac:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
## ..@ meta.data :'data.frame': 50637 obs. of 27 variables:
## .. ..$ orig.ident : Factor w/ 7 levels "10hpf","12hpf",..: 4 4 4 4 4 4 4 4 4 4 ...
## .. ..$ nCount_atac : num [1:50637] 30862 26934 24272 24893 29109 ...
## .. ..$ nFeature_atac : int [1:50637] 15150 13872 13368 11756 13537 11722 12082 12192 9450 11516 ...
## .. ..$ ...1 : chr [1:50637] "3hpf_1#3hpf_1_merge_BC0443_N27" "3hpf_1#3hpf_1_merge_BC0069_N07" "3hpf_1#3hpf_1_merge_BC0033_N05" "3hpf_1#3hpf_1_merge_BC0028_N08" ...
## .. ..$ Sample : chr [1:50637] "3hpf_1" "3hpf_1" "3hpf_1" "3hpf_1" ...
## .. ..$ TSSEnrichment : num [1:50637] 6.5 4.21 8.49 5.02 6.57 ...
## .. ..$ ReadsInTSS : num [1:50637] 1452 829 1171 684 1337 ...
## .. ..$ ReadsInPromoter : num [1:50637] 7341 4887 5702 3907 6664 ...
## .. ..$ ReadsInBlacklist : num [1:50637] 0 0 0 0 0 0 0 0 0 0 ...
## .. ..$ PromoterRatio : num [1:50637] 0.0531 0.0356 0.0488 0.0359 0.0616 ...
## .. ..$ PassQC : num [1:50637] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ NucleosomeRatio : num [1:50637] 1.56 1.36 1.3 1.41 1.35 ...
## .. ..$ nMultiFrags : num [1:50637] 16414 15982 11720 12508 10532 ...
## .. ..$ nMonoFrags : num [1:50637] 27022 29078 25398 22610 23029 ...
## .. ..$ nFrags : num [1:50637] 69102 68665 58378 54390 54053 ...
## .. ..$ nDiFrags : num [1:50637] 25666 23605 21260 19272 20492 ...
## .. ..$ BlacklistRatio : num [1:50637] 0 0 0 0 0 0 0 0 0 0 ...
## .. ..$ stage : chr [1:50637] "3hpf" "3hpf" "3hpf" "3hpf" ...
## .. ..$ Clusters : chr [1:50637] "C3" "C3" "C3" "C3" ...
## .. ..$ ReadsInPeaks : num [1:50637] 47511 42150 38848 33031 38019 ...
## .. ..$ FRIP : num [1:50637] 0.344 0.307 0.333 0.304 0.352 ...
## .. ..$ celltype : chr [1:50637] "blastomere" "blastomere" "blastomere" "blastomere" ...
## .. ..$ predictedCell : chr [1:50637] "6h_3 CELL4645_N1 _" "3h1_CELL1337_N1_3h1" "3h1_CELL1337_N1_3h1" "3h1_CELL1197_N1_3h1" ...
## .. ..$ predictedGroup : chr [1:50637] "margin" "blastomere" "blastomere" "blastomere" ...
## .. ..$ predictedScore : num [1:50637] 0.605 0.982 0.986 0.576 0.598 ...
## .. ..$ DoubletScore : num [1:50637] 323 169 323 140 323 ...
## .. ..$ DoubletEnrichment: num [1:50637] 15.3 7.9 16.6 7.1 24.3 3.4 16.2 2.3 7 5.8 ...
## ..@ active.assay: chr "atac"
## ..@ active.ident: Factor w/ 7 levels "10hpf","12hpf",..: 4 4 4 4 4 4 4 4 4 4 ...
## .. ..- attr(*, "names")= chr [1:50637] "24hpf_1#24hpf_1_BC00224_N06" "24hpf_1#24hpf_1_BC00014_N03" "24hpf_1#24hpf_1_BC00075_N02" "24hpf_1#24hpf_1_BC00171_N03" ...
## ..@ graphs : list()
## ..@ neighbors : list()
## ..@ reductions : list()
## ..@ images : list()
## ..@ project.name: chr "SeuratProject"
## ..@ misc : list()
## ..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
## .. ..$ : int [1:3] 5 0 2
## ..@ commands : list()
## ..@ tools : list()
Next, the data was transformed. Overall, it was pseudo bulked, log
normalized and saved as an RDS file. pseudo-bulking and normalization
was performed with the seurat function AggregateExpression
and return.seurat = T was used to save the counts in a new
pseudo-bulked seurat object psd.bulk.zfish_atac.
#pseudobulking and normalizing
psd.bulk.zfish_atac= AggregateExpression(zfish_atac, group.by = "celltype", normalization.method ="LogNormalize", return.seurat = T )
## Centering and scaling data matrix
#output
glimpse(psd.bulk.zfish_atac)
## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
## ..@ assays :List of 1
## .. ..$ atac:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
## ..@ meta.data :'data.frame': 23 obs. of 2 variables:
## .. ..$ orig.ident: chr [1:23] "anterior/posterior axis" "blastomere" "central nervous system" "digestive system" ...
## .. ..$ celltype : chr [1:23] "anterior/posterior axis" "blastomere" "central nervous system" "digestive system" ...
## ..@ active.assay: chr "atac"
## ..@ active.ident: Factor w/ 23 levels "anterior/posterior axis",..: 1 2 3 4 5 6 7 8 9 10 ...
## .. ..- attr(*, "names")= chr [1:23] "anterior/posterior axis" "blastomere" "central nervous system" "digestive system" ...
## ..@ graphs : list()
## ..@ neighbors : list()
## ..@ reductions : list()
## ..@ images : list()
## ..@ project.name: chr "Aggregate"
## ..@ misc : list()
## ..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
## .. ..$ : int [1:3] 5 0 2
## ..@ commands :List of 1
## .. ..$ ScaleData.atac:Formal class 'SeuratCommand' [package "SeuratObject"] with 5 slots
## ..@ tools : list()
The resulting seurat object was saved as an RDS for further use.
saveRDS(psd.bulk.zfish_atac, file = "Data/psd.bulk.zfish_atac.rds")
Information about the coding environment can be found below.
sessionInfo()
## R version 4.4.2 (2024-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.utf8
##
## time zone: Europe/Berlin
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Signac_1.14.0 Seurat_5.2.1 SeuratObject_5.0.2 sp_2.2-0
## [5] broom_1.0.8 binom_1.1-1.1 lubridate_1.9.4 forcats_1.0.0
## [9] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.4 readr_2.1.5
## [13] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 rstudioapi_0.17.1 jsonlite_1.8.9
## [4] magrittr_2.0.3 spatstat.utils_3.1-2 farver_2.1.2
## [7] rmarkdown_2.29 zlibbioc_1.52.0 vctrs_0.6.5
## [10] ROCR_1.0-11 Rsamtools_2.22.0 spatstat.explore_3.3-4
## [13] RcppRoll_0.3.1 htmltools_0.5.8.1 sass_0.4.10
## [16] sctransform_0.4.1 parallelly_1.43.0 KernSmooth_2.23-26
## [19] bslib_0.9.0 htmlwidgets_1.6.4 ica_1.0-3
## [22] plyr_1.8.9 plotly_4.10.4 zoo_1.8-12
## [25] cachem_1.1.0 igraph_2.1.4 mime_0.12
## [28] lifecycle_1.0.4 pkgconfig_2.0.3 Matrix_1.7-1
## [31] R6_2.6.1 fastmap_1.2.0 GenomeInfoDbData_1.2.13
## [34] fitdistrplus_1.2-2 future_1.40.0 shiny_1.10.0
## [37] digest_0.6.37 colorspace_2.1-1 S4Vectors_0.44.0
## [40] patchwork_1.3.0 tensor_1.5 RSpectra_0.16-2
## [43] irlba_2.3.5.1 GenomicRanges_1.58.0 progressr_0.15.1
## [46] spatstat.sparse_3.1-0 timechange_0.3.0 httr_1.4.7
## [49] polyclip_1.10-7 abind_1.4-8 compiler_4.4.2
## [52] bit64_4.6.0-1 withr_3.0.2 backports_1.5.0
## [55] BiocParallel_1.40.2 fastDummies_1.7.5 MASS_7.3-64
## [58] tools_4.4.2 lmtest_0.9-40 httpuv_1.6.15
## [61] future.apply_1.11.3 goftest_1.2-3 glue_1.8.0
## [64] nlme_3.1-167 promises_1.3.2 grid_4.4.2
## [67] Rtsne_0.17 cluster_2.1.8 reshape2_1.4.4
## [70] generics_0.1.3 gtable_0.3.6 spatstat.data_3.1-6
## [73] tzdb_0.4.0 data.table_1.16.4 hms_1.1.3
## [76] XVector_0.46.0 BiocGenerics_0.52.0 spatstat.geom_3.3-5
## [79] RcppAnnoy_0.0.22 ggrepel_0.9.6 RANN_2.6.2
## [82] pillar_1.10.2 vroom_1.6.5 spam_2.11-1
## [85] RcppHNSW_0.6.0 later_1.4.1 splines_4.4.2
## [88] lattice_0.22-6 bit_4.6.0 survival_3.8-3
## [91] deldir_2.0-4 tidyselect_1.2.1 Biostrings_2.74.1
## [94] miniUI_0.1.1.1 pbapply_1.7-2 knitr_1.50
## [97] gridExtra_2.3 IRanges_2.40.1 scattermore_1.2
## [100] stats4_4.4.2 xfun_0.52 matrixStats_1.5.0
## [103] UCSC.utils_1.2.0 stringi_1.8.4 lazyeval_0.2.2
## [106] yaml_2.3.10 evaluate_1.0.3 codetools_0.2-20
## [109] cli_3.6.1 uwot_0.2.2 xtable_1.8-4
## [112] reticulate_1.40.0 munsell_0.5.1 jquerylib_0.1.4
## [115] Rcpp_1.0.14 GenomeInfoDb_1.42.3 globals_0.17.0
## [118] spatstat.random_3.3-2 png_0.1-8 spatstat.univar_3.1-1
## [121] parallel_4.4.2 dotCall64_1.2 bitops_1.0-9
## [124] listenv_0.9.1 viridisLite_0.4.2 scales_1.3.0
## [127] ggridges_0.5.6 crayon_1.5.3 rlang_1.1.4
## [130] fastmatch_1.1-6 cowplot_1.1.3